Back in 2014, fivethiryeight.com published an article on alcohol consumption in different countries. We want to use the data in this article to find out in which countries people drink the most by looking at the consumption of beer, wine, and spirits.
library(fivethirtyeight)
data(drinks)
# or download directly
alcohol_direct <- read_csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/alcohol-consumption/drinks.csv")Let’s take a look at our data and find whether there are any missing values and what variables we have.
skim(alcohol_direct)| Name | alcohol_direct |
| Number of rows | 193 |
| Number of columns | 5 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| numeric | 4 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| country | 0 | 1 | 3 | 28 | 0 | 193 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| beer_servings | 0 | 1 | 106.16 | 101.14 | 0 | 20.0 | 76.0 | 188.0 | 376.0 | ▇▃▂▂▁ |
| spirit_servings | 0 | 1 | 80.99 | 88.28 | 0 | 4.0 | 56.0 | 128.0 | 438.0 | ▇▃▂▁▁ |
| wine_servings | 0 | 1 | 49.45 | 79.70 | 0 | 1.0 | 8.0 | 59.0 | 370.0 | ▇▁▁▁▁ |
| total_litres_of_pure_alcohol | 0 | 1 | 4.72 | 3.77 | 0 | 1.3 | 4.2 | 7.2 | 14.4 | ▇▃▅▃▁ |
The dataset contains one character (qualitative) variable, namely country, and four numeric variables (beer_servings, spirits_servints, wine_servings and total_litres_of_pure_alcohol) for which we can also see some summary statistics showing that the data is right-skewed. There are no missing values as can be seen from the n_missing variable that is 0 for all five variables contained in the dataset.
We use ggplot to graph the top 25 consuming countries for beer, wine and spirits respectively.
alcohol_direct %>%
slice_max(order_by = beer_servings,
n = 25) %>% #pick only the top 25 countries with the most beer_servings
ggplot(aes(x = beer_servings,
y = fct_reorder(country, beer_servings))) + #order the country with beer_servings
geom_col(fill = "gold",
color = "grey") +
labs(title="Top 25 beer consuming countries",
x = "Number of Beer Servings",
y = "Country") +
coord_cartesian(xlim = c(100,NA)) +
NULLalcohol_direct %>%
slice_max(order_by = wine_servings,
n = 25) %>%
ggplot(aes(x = wine_servings,
y = fct_reorder(country, wine_servings))) +
geom_col(fill = "deeppink4", #choose the most wine-like color in the list of color table
color = "grey") +
labs(title = "Top 25 wine consuming countries",
x = "Number of Wine Servings",
y = "Country") +
coord_cartesian(xlim = c(100,NA)) +
NULLalcohol_direct %>%
slice_max(order_by = spirit_servings,
n = 25) %>%
ggplot(aes(x = spirit_servings,
y = fct_reorder(country,
spirit_servings))) +
geom_col(fill = "cornsilk3", color = "grey") +
labs(title = "Top 25 spirits consuming countries",
x = "Number of Spirit Servings",
y = "Country") +
coord_cartesian(xlim = c(100,NA)) +
NULLWe can infer from the above graphs…
On a high-level overview we observe different countries being in the top spots for each type of alcohol.
Beginning with beer consumption, the top 5 spots are dominated by central and eastern European countries with a few “surprises”. Namibia apparently has a long history of beer production. According to CNN, the first Namibian brewery opened in 1900, with every ethnic group in Africa having its own methods to create the famous beverage. Nowadays local beer is a tourist attraction and a large source of revenues for both Namibia and Ghana, explaining their placement in the list. As for the rest of the countries comprising the top 5, it is known that beer is very inexpensive there, with Czech Republic offering beer at a lower price than water in some pubs and restaurants.
Moving to wine, the top spot is with France (arguably) highly anticipated. Bordeaux, a rural city in France has got some of the world’s finest wineries, with its wine being beloved both locally and internationally. Similar reasoning applies to Portugal. Douro Port is the country’s most famous wine, with many casual drinkers switching to premium wines, according to wininteligence.com. What is definitely worthy of commentary is Andorra’s placement on the list, because of its small population ranking it amongst the six European countries with the least residents. Again, as Andorra is a famous tourist destination because of its ski resorts, the high wine consumption is mostly attributed to tourists.
Finally, regarding spirits we have mixed signals about the top 5. Although we would most definitely expect eastern countries such as Russia, the motherland of Vodka to be high on the list, Grenada seems to be number one, with another Caribbean country, St. Lucia, following on fifth. According to jamaicaobserver.com, both countries seem to have problems with illegal consumption since 2014, when the WHO declared Grenada as the country with the highest alcohol per capita consumption in the Caribbean.
We will look at a subset sample of movies, taken from the Kaggle IMDB 5000 movie dataset.
movies <- read.csv(here::here("data", "movies.csv"))
glimpse(movies)## Rows: 2,961
## Columns: 11
## $ title <chr> "Avatar", "Titanic", "Jurassic World", "The Avenge…
## $ genre <chr> "Action", "Drama", "Action", "Action", "Action", "…
## $ director <chr> "James Cameron", "James Cameron", "Colin Trevorrow…
## $ year <int> 2009, 1997, 2015, 2012, 2008, 1999, 1977, 2015, 20…
## $ duration <int> 178, 194, 124, 173, 152, 136, 125, 141, 164, 93, 1…
## $ gross <int> 760505847, 658672302, 652177271, 623279547, 533316…
## $ budget <int> 237000000, 200000000, 150000000, 220000000, 185000…
## $ cast_facebook_likes <int> 4834, 45223, 8458, 87697, 57802, 37723, 13485, 920…
## $ votes <int> 886204, 793059, 418214, 995415, 1676169, 534658, 9…
## $ reviews <int> 3777, 2843, 1934, 2425, 5312, 3917, 1752, 1752, 35…
## $ rating <dbl> 7.9, 7.7, 7.0, 8.1, 9.0, 6.5, 8.7, 7.5, 8.5, 7.2, …
Let’s see if there are any missing values in the dataset.
skim(movies) #no missing values (n_missing is 0 for all variables)| Name | movies |
| Number of rows | 2961 |
| Number of columns | 11 |
| _______________________ | |
| Column type frequency: | |
| character | 3 |
| numeric | 8 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| title | 0 | 1 | 1 | 83 | 0 | 2907 | 0 |
| genre | 0 | 1 | 5 | 11 | 0 | 17 | 0 |
| director | 0 | 1 | 3 | 32 | 0 | 1366 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| year | 0 | 1 | 2.00e+03 | 9.95e+00 | 1920.0 | 2.00e+03 | 2.00e+03 | 2.01e+03 | 2.02e+03 | ▁▁▁▂▇ |
| duration | 0 | 1 | 1.10e+02 | 2.22e+01 | 37.0 | 9.50e+01 | 1.06e+02 | 1.19e+02 | 3.30e+02 | ▃▇▁▁▁ |
| gross | 0 | 1 | 5.81e+07 | 7.25e+07 | 703.0 | 1.23e+07 | 3.47e+07 | 7.56e+07 | 7.61e+08 | ▇▁▁▁▁ |
| budget | 0 | 1 | 4.06e+07 | 4.37e+07 | 218.0 | 1.10e+07 | 2.60e+07 | 5.50e+07 | 3.00e+08 | ▇▂▁▁▁ |
| cast_facebook_likes | 0 | 1 | 1.24e+04 | 2.05e+04 | 0.0 | 2.24e+03 | 4.60e+03 | 1.69e+04 | 6.57e+05 | ▇▁▁▁▁ |
| votes | 0 | 1 | 1.09e+05 | 1.58e+05 | 5.0 | 1.99e+04 | 5.57e+04 | 1.33e+05 | 1.69e+06 | ▇▁▁▁▁ |
| reviews | 0 | 1 | 5.03e+02 | 4.94e+02 | 2.0 | 1.99e+02 | 3.64e+02 | 6.31e+02 | 5.31e+03 | ▇▁▁▁▁ |
| rating | 0 | 1 | 6.39e+00 | 1.05e+00 | 1.6 | 5.80e+00 | 6.50e+00 | 7.10e+00 | 9.30e+00 | ▁▁▆▇▁ |
Looking at the output above and the n_missing variable, we can see that this is 0 for all variables, meaning that there are no missing values.
Then we want to find out if there are duplicate entries in the dataset.
#filter for duplicates
movies %>%
filter(duplicated(title)) %>%
glimpse()## Rows: 54
## Columns: 11
## $ title <chr> "Spider-Man 3", "Alice in Wonderland", "Oz the Gre…
## $ genre <chr> "Action", "Adventure", "Adventure", "Drama", "Dram…
## $ director <chr> "Sam Raimi", "Tim Burton", "Sam Raimi", "Kenneth B…
## $ year <int> 2007, 2010, 2013, 2015, 2008, 2014, 2001, 2015, 20…
## $ duration <int> 156, 108, 130, 105, 122, 101, 119, 94, 94, 125, 10…
## $ gross <int> 336530303, 334185206, 234903076, 201148159, 191449…
## $ budget <int> 258000000, 200000000, 215000000, 95000000, 3700000…
## $ cast_facebook_likes <int> 46055, 79957, 73441, 4671, 44060, 2690, 2776, 1788…
## $ votes <int> 383071, 306336, 175413, 103749, 348007, 167089, 17…
## $ reviews <int> 2294, 1187, 1036, 666, 1885, 839, 1598, 379, 379, …
## $ rating <dbl> 6.2, 6.5, 6.4, 7.0, 5.2, 5.9, 5.7, 6.7, 6.7, 6.8, …
The output shows that we have 54 duplicates based on the title of the movies. After having checked the duplicates (see example below), we have come to the conclusion that, while the duplicated entries are not 100% identical, they only differ in terms of the number of votes and reviews and thereby to a very little extent. This is why we remove the 54 duplicates identified above as a next step from our dataset.
#check in which way duplicate observations differ with example of Cinderalla
movies %>%
filter(title=="Cinderella")## title genre director year duration gross budget
## 1 Cinderella Drama Kenneth Branagh 2015 105 201148159 95000000
## 2 Cinderella Drama Kenneth Branagh 2015 105 201148159 95000000
## cast_facebook_likes votes reviews rating
## 1 4671 103737 665 7
## 2 4671 103749 666 7
Let’s count movies by genre and see which genre has the most entries in our dataset. Our guess would be comedy because it’s the easiest type for entertainment.
movies <- movies %>%
distinct(title,
.keep_all = TRUE) #Removing the 54 duplicates
movies %>%
count(genre,
sort = TRUE) #Count by genre and sort in descending order## genre n
## 1 Comedy 844
## 2 Action 719
## 3 Drama 484
## 4 Adventure 281
## 5 Crime 198
## 6 Biography 135
## 7 Horror 128
## 8 Animation 35
## 9 Fantasy 26
## 10 Documentary 25
## 11 Mystery 15
## 12 Sci-Fi 7
## 13 Family 3
## 14 Musical 2
## 15 Romance 2
## 16 Western 2
## 17 Thriller 1
We now produce a table with the average gross earning and budget (gross and budget) by genre and try to calculate the return on budget for each genre.
Our approach is first creating a variable named return_on_budget for every movie and then calculating its average in each genre. In this way we can better represent how profitable a movie is on average in each genre. An alternative approach would be using average earning over average budget, but the result would skew to those movies with bigger budget, so this would be somewhat misleading.
#Producing a table of each genre's average gross, budget and return
movies %>%
mutate(return_on_budget = gross/budget) %>%
group_by(genre) %>%
summarise(gross_avg = mean(gross),
budget_avg = mean(budget),
return_on_budget_avg = mean(return_on_budget)) %>%
arrange(desc(return_on_budget_avg))## # A tibble: 17 × 4
## genre gross_avg budget_avg return_on_budget_avg
## <chr> <dbl> <dbl> <dbl>
## 1 Horror 37782310. 13804379. 86.1
## 2 Biography 45201805. 28543696. 22.3
## 3 Musical 92084000 3189500 18.8
## 4 Family 149160478. 14833333. 14.1
## 5 Documentary 17353973. 5887852. 8.70
## 6 Western 20821884 3465000 7.06
## 7 Fantasy 41902674. 18484615. 6.10
## 8 Animation 98433792. 61701429. 5.01
## 9 Comedy 42487808. 24458506. 3.70
## 10 Romance 31264848. 25107500 3.17
## 11 Drama 36754959. 25832605. 2.98
## 12 Mystery 69117136. 41500000 2.90
## 13 Adventure 94350236. 64692313. 2.44
## 14 Crime 37601525. 26527405. 2.19
## 15 Action 86270343. 70774558. 1.93
## 16 Sci-Fi 29788371. 27607143. 1.58
## 17 Thriller 2468 300000 0.00823
Horror movies, biographies and musical movies are the top 3 in terms of profitability.
Let us also find out who are the directors that create the highest revenue-generating movies.
#Produce a table of directors regarding their gross
movies %>%
group_by(director) %>%
summarise(gross_total = sum(gross),
gross_mean = mean(gross),
gross_median = median(gross),
gross_sd = sd(gross)) %>%
slice_max(order_by = gross_total,
n = 15)## # A tibble: 15 × 5
## director gross_total gross_mean gross_median gross_sd
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Steven Spielberg 4014061704 174524422. 164435221 101421051.
## 2 Michael Bay 2195443511 182953626. 168468240. 125789167.
## 3 James Cameron 1909725910 318287652. 175562880. 309171337.
## 4 Christopher Nolan 1813227576 226653447 196667606. 187224133.
## 5 George Lucas 1741418480 348283696 380262555 146193880.
## 6 Robert Zemeckis 1619309108 124562239. 100853835 91300279.
## 7 Tim Burton 1557078534 111219895. 69791834 99304293.
## 8 Sam Raimi 1443167519 180395940. 138480208 174705230.
## 9 Clint Eastwood 1378321100 72543216. 46700000 75487408.
## 10 Francis Lawrence 1358501971 271700394. 281666058 135437020.
## 11 Ron Howard 1335988092 111332341 101587923 81933761.
## 12 Gore Verbinski 1329600995 189942999. 123207194 154473822.
## 13 Andrew Adamson 1137446920 284361730 279680930. 120895765.
## 14 Shawn Levy 1129750988 102704635. 85463309 65484773.
## 15 Ridley Scott 1128857598 80632686. 47775715 68812285.
Apparently Steven Spielberg leads the ranking, however, he is not one of the directors with the highest mean revenue (gross_mean) or median revenue (gross_median) or lowest standard deviation of revenues (gross_sd).
How about ratings? Let’s see how each genre is regarded by its viewers.
#Produce a table showing each genre's rating information
movies %>%
group_by(genre) %>%
summarise(rating_mean = mean(rating),
rating_min = min(rating),
rating_max = max(rating),
rating_median = median(rating),
rating_sd = sd(rating)) %>%
arrange(desc(rating_mean))## # A tibble: 17 × 6
## genre rating_mean rating_min rating_max rating_median rating_sd
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Biography 7.11 4.5 8.9 7.2 0.760
## 2 Crime 6.92 4.8 9.3 6.9 0.853
## 3 Mystery 6.84 4.6 8.5 6.7 0.910
## 4 Musical 6.75 6.3 7.2 6.75 0.636
## 5 Drama 6.74 2.1 8.8 6.8 0.915
## 6 Documentary 6.66 1.6 8.5 7.4 1.77
## 7 Sci-Fi 6.66 5 8.2 6.4 1.09
## 8 Animation 6.65 4.5 8 6.9 0.968
## 9 Romance 6.65 6.2 7.1 6.65 0.636
## 10 Adventure 6.51 2.3 8.6 6.6 1.11
## 11 Family 6.5 5.7 7.9 5.9 1.22
## 12 Action 6.23 2.1 9 6.3 1.04
## 13 Comedy 6.11 1.9 8.8 6.2 1.02
## 14 Fantasy 6.08 4.3 7.9 6.2 0.953
## 15 Horror 5.79 3.6 8.5 5.85 0.987
## 16 Western 5.7 4.1 7.3 5.7 2.26
## 17 Thriller 4.8 4.8 4.8 4.8 NA
# Graph the distribution of ratings for all movies
ggplot(movies, aes(x=rating)) +
geom_histogram(bins = 20,
color = "grey",
fill = "lightblue") +
labs(title = "Distribution of ratings for all movies",
y = "Number of Movies",
x = "Rating Score") +
NULL#Graph the distribution of ratings per genre (Thriller only includes 1 movie, so no graph shown)
ggplot(movies,
aes(x = rating)) +
#Density plot to make comparison among genres easier, considering difference in sample sizes
geom_density(color = "grey",
fill = "lightblue") +
facet_wrap(~ genre) +
labs(title = "Distribution of ratings per genre",
y = NULL,
x = "Rating Score") +
NULL#check why no curve for genre Thriller
movies %>%
filter(genre=="Thriller")## title genre director year duration gross budget
## 1 Locker 13 Thriller Bruce Dellis 2014 95 2468 300000
## cast_facebook_likes votes reviews rating
## 1 2048 241 15 4.8
We only have 1 movie categorized as thriller in the dataset, so there’s no density plot for thriller. We observe that genres like family, western, romance and musical have multimodal distribution, because they only have single-digit samples. If we do not take these multimodal genres into account, the best-regarded genres are biography, crime and documentary, because their rating scores are the most concentrated around the peak in distribution and their peaks are highest-scored. The box plot also indicates the same, as depicted below.
#Box plot to make the comparison again, telling the same conclusion (biography wins) in a more quantitative perspective
ggplot(movies,
aes(x = rating)) +
geom_boxplot(color = "grey",
fill = "lightblue") +
facet_wrap(~ genre) +
labs(title = "Distribution of ratings per genre",
y = NULL,
x = "Rating Score") +
NULLThe importance of social media nowadays could mean that the number of facebook likes that the cast of a movie has is likely to be a good predictor of how much money a movie will make at the box office. We therefore examine the relationship between gross and cast_facebook_likes and find out if this is true.
#Create a scatterplot and a trend line
ggplot(movies,
aes(x = cast_facebook_likes,
y = gross))+
geom_point(alpha = 0.5) +
geom_smooth() +
#Log scale presents a more significant correlation
#Don't want to use scientific notation here
scale_y_log10(labels = scales::comma) +
scale_x_log10(labels = scales::comma) +
labs (x = "Number of Facebook Likes for Cast",
y = "Gross Revenue of Movie",
title = "Relationship between facebook likes of cast and gross of movie") +
NULL#Calculate correlation coefficient to understand strength of relationship
movies %>%
select(cast_facebook_likes,gross) %>%
cor()## cast_facebook_likes gross
## cast_facebook_likes 1.000 0.208
## gross 0.208 1.000
Looking at the scatterplot that uses a logarithmic scale as well as the correlation coefficient, there is a certain positive relationship between the number of Facebook likes the cast of a movie has and the revenue this movie will make at the box office. This can be taken as a sign that a more popular or social-media active cast will create higher revenues. However, this relationship is with a correlation coefficient of 0.213 not very strong.
We also examine the relationship between gross and budget to see whether the budget is likely to be a better predictor of how much money a movie will make at the box office.
#Create a scatterplot with a trend line
ggplot(movies,
aes(x = budget,
y = gross))+
geom_point() +
scale_y_log10(labels = scales::comma) +
scale_x_log10(labels = scales::comma) +
geom_smooth() +
labs (title = "Relationship between budget and revenue of movie",
x = "Budget of Movie",
y = "Gross Revenue of Movie") +
NULL#Calculate the correlation coefficient to understand strength of relationship
movies %>%
select(budget,gross) %>%
cor()## budget gross
## budget 1.000 0.639
## gross 0.639 1.000
There is a clear positive relationship between the budget a movie has and the revenues it achieves at the box office (correlation coefficient of 0.641). This means that movies with a higher budget will also be the ones that make more revenues. We can infer that budget is a better predictor of gross revenues than the number of Facebook likes of the cast.
The third possible predictor of gross revenues of a movie could be its IMDB rating. Let’s examine this relationship with the same method and facet it by genre.
#Create a scatterplot with a trend line per genre
ggplot(movies,
aes(x = rating,
y = gross))+
geom_point() +
geom_smooth() +
facet_wrap(~genre) +
scale_y_continuous(labels = scales::comma) +
labs (title = "Relationship between the rating of a movie and its gross revenue",
x = "Rating of Movie",
y = "Gross Revenue of Movie") +
NULL#Calculate correlation coefficient to understand strength of relationship (for all genres together)
movies %>%
select(gross,rating) %>%
cor()## gross rating
## gross 1.000 0.274
## rating 0.274 1.000
Looking at the scatterplots above, a certain positive relationship between the ratings of a movie and its gross revenue can be inferred, especially for the genres action, adventure, comedy, crime, drama and horror. As the steepness of the smooth line increases the higher the ratings get for these genres, this means that an increase in rating for movies that are already highly ranked will bring a higher increase in revenues than a rating increase for movies that are rated low. However, for some genres such as animation or documentary there is no clear relationship, which hints at the fact that people tend to use ratings for some genres more than others to decide whether to watch the movie.
We will use the tidyquant package to download historical data of stock prices, calculate returns, and examine the distribution of returns.
We must first identify which stocks we want to download data for, and for this we must know their ticker symbol. The file nyse.csv contains 508 stocks listed on the NYSE, their ticker symbol, name, the IPO year, and the sector and industry the company is in.
nyse <- read_csv(here::here("data","nyse.csv"))We create a table and a bar plot that shows the number of companies per sector, in descending order.
company_sector_sum <- nyse %>% #create dataframe that counts # of companies per sector
group_by(sector) %>%
count(sort = TRUE)
company_sector_sum #show the table## # A tibble: 12 × 2
## # Groups: sector [12]
## sector n
## <chr> <int>
## 1 Finance 97
## 2 Consumer Services 79
## 3 Public Utilities 60
## 4 Capital Goods 45
## 5 Health Care 45
## 6 Energy 42
## 7 Technology 40
## 8 Basic Industries 39
## 9 Consumer Non-Durables 31
## 10 Miscellaneous 12
## 11 Transportation 10
## 12 Consumer Durables 8
ggplot(company_sector_sum,
aes(x = reorder(sector, -n),
y = n)) + #plot output from above
theme(axis.text.x = element_text(angle = 45, # Tilt the x-label at an angle of 45
hjust = 1)) +
geom_bar(fill = "lightblue",
color = "grey",
stat = "identity") +
labs(title = "Number of companies per sector",
x = "Sector",
y = "Number of Companies") +
NULLNext, let’s choose the Dow Jones Industrial Aveareg (DJIA) stocks and their ticker symbols and download some data. Besides the thirty stocks that make up the DJIA, we will also add SPY which is an SP500 ETF.
djia_url <- "https://en.wikipedia.org/wiki/Dow_Jones_Industrial_Average"
#get tables that exist on URL
tables <- djia_url %>%
read_html() %>%
html_nodes(css="table")
# parse HTML tables into a dataframe called djia.
# Use purr::map() to create a list of all tables in URL
djia <- map(tables, . %>%
html_table(fill=TRUE)%>%
clean_names())
# constituents
table1 <- djia[[2]] %>% # the second table on the page contains the ticker symbols
mutate(date_added = ymd(date_added),
# if a stock is listed on NYSE, its symbol is, e.g., NYSE: MMM
# We will get prices from yahoo finance which requires just the ticker
# if symbol contains "NYSE*", the * being a wildcard
# then we jsut drop the first 6 characters in that string
ticker = ifelse(str_detect(symbol, "NYSE*"),
str_sub(symbol,7,11),
symbol)
)
# we need a vector of strings with just the 30 tickers + SPY
tickers <- table1 %>%
select(ticker) %>%
pull() %>% # pull() gets them as a sting of characters
c("SPY") # and lets us add SPY, the SP500 ETFNow let us downlaod prices for all 30 DJIA consituents and the SPY ETF that tracks SP500 since January 1, 2020
# Notice the cache=TRUE argument in the chunk options. Because getting data is time consuming, # cache=TRUE means that once it downloads data, the chunk will not run again next time you knit your Rmd
myStocks <- tickers %>%
tq_get(get = "stock.prices",
from = "2000-01-01",
to = Sys.Date()) %>% # Sys.Date() returns today's price
group_by(symbol)
glimpse(myStocks) # examine the structure of the resulting data frame## Rows: 161,491
## Columns: 8
## Groups: symbol [31]
## $ symbol <chr> "MMM", "MMM", "MMM", "MMM", "MMM", "MMM", "MMM", "MMM", "MMM"…
## $ date <date> 2000-01-03, 2000-01-04, 2000-01-05, 2000-01-06, 2000-01-07, …
## $ open <dbl> 48.0, 46.4, 45.6, 47.2, 50.6, 50.2, 50.4, 51.0, 50.7, 50.4, 4…
## $ high <dbl> 48.2, 47.4, 48.1, 51.2, 51.9, 51.8, 51.2, 51.8, 50.9, 50.5, 4…
## $ low <dbl> 47.0, 45.3, 45.6, 47.2, 50.0, 50.0, 50.2, 50.4, 50.2, 49.5, 4…
## $ close <dbl> 47.2, 45.3, 46.6, 50.4, 51.4, 51.1, 50.2, 50.4, 50.4, 49.7, 4…
## $ volume <dbl> 2173400, 2713800, 3699400, 5975800, 4101200, 3863800, 2357600…
## $ adjusted <dbl> 27.2, 26.1, 26.9, 29.0, 29.6, 29.4, 28.9, 29.0, 29.0, 28.6, 2…
Given the adjusted closing prices, our first step is to calculate daily and monthly returns.
#calculate daily returns
myStocks_returns_daily <- myStocks %>%
tq_transmute(select = adjusted,
mutate_fun = periodReturn,
period = "daily",
type = "log",
col_rename = "daily_returns",
cols = c(nested.col))
#calculate monthly returns
myStocks_returns_monthly <- myStocks %>%
tq_transmute(select = adjusted,
mutate_fun = periodReturn,
period = "monthly",
type = "arithmetic",
col_rename = "monthly_returns",
cols = c(nested.col))
#calculate yearly returns
myStocks_returns_annual <- myStocks %>%
group_by(symbol) %>%
tq_transmute(select = adjusted,
mutate_fun = periodReturn,
period = "yearly",
type = "arithmetic",
col_rename = "yearly_returns",
cols = c(nested.col))We then summarise monthly returns for each of the stocks and SPY, including its min, max, median, mean and standard deviation.
#Create the summary dataset
myStocks_monthly_returns <- myStocks_returns_monthly %>%
group_by(symbol) %>%
summarise(Mean = mean(monthly_returns),
Minimum = min(monthly_returns),
Maximum = max(monthly_returns),
Median = median(monthly_returns),
StandardDeviation = sd(monthly_returns)) %>%
arrange(desc(Mean))
#Display the summary in table
myStocks_monthly_returns## # A tibble: 31 × 6
## symbol Mean Minimum Maximum Median StandardDeviation
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 AAPL 0.0269 -0.577 0.454 0.0338 0.115
## 2 CRM 0.0263 -0.360 0.403 0.0199 0.111
## 3 V 0.0200 -0.196 0.338 0.0255 0.0661
## 4 UNH 0.0191 -0.306 0.266 0.0219 0.0698
## 5 NKE 0.0162 -0.375 0.396 0.0162 0.0756
## 6 DOW 0.0149 -0.276 0.255 0.0367 0.111
## 7 CAT 0.0143 -0.353 0.350 0.0133 0.0899
## 8 BA 0.0127 -0.458 0.459 0.0167 0.0926
## 9 MSFT 0.0114 -0.344 0.408 0.0172 0.0819
## 10 GS 0.0110 -0.275 0.312 0.0163 0.0925
## # … with 21 more rows
We then create a density plot for each of the stocks to see their performances.
#Create a density plot of monthly returns
ggplot(myStocks_returns_monthly,
aes(x = monthly_returns)) +
geom_density() +
facet_wrap(~symbol) +
scale_x_continuous(labels = scales::percent_format(accuracy=1))+ #x-axis in percentage
labs(title="Distribution of monthly returns of selected stocks",
x = "Monthly Return",
y="Density")+
NULLWe can infer from the above table and plot…
The plot shows that most return distributions are non-normal, evidencing a high level of kurtosis. It also shows that all of the companies have positive average returns, albeit some are more marginally positive than others. AXP has the highest maximum monthly return and AAPL has the lowest minimum.
The riskiest stock, surprisingly, is Apple. This is because it has the highest standard deviation (and lowest kurtosis) and lowest minimum return out of index and S&P500 tracker. This is somewhat counterintuitive because Apple also has the highest mean return out of the stock selection, but as risk in finance is measured by volatility we conclude that Apple is the riskiest.
We believe that, in line with financial theory, that the SPY is the least risky asset among those examined. This is because its distribution is heavily concentrated around the 0% return demonstrated by the highest density out of the group at a single point. It also has the lowest standard deviation, again expected, due to the diversified nature of the index product. Thus, we conclude that the SPY is the least risky out of the selection.
Finally, we want to make a scatter plot that shows the expected monthly return and risk of a stock.
#Create a scatterplot with mean and sd of monthly returns
ggplot(myStocks_monthly_returns,
aes(x = StandardDeviation,
y = Mean)) +
geom_point()+
geom_smooth(method = "lm",
se=FALSE)+
ggrepel::geom_text_repel(aes(label = symbol)) +
scale_x_continuous(labels = scales::percent_format(accuracy = 0.1))+
scale_y_continuous(labels = scales::percent_format(accuracy = 0.1))+
labs(title="Expected monthly return of stocks and their risk",
x = "Standard Deviation",
y="Expected Monthly Return")+
NULLWhat can you infer from this plot? Are there any stocks which, while being riskier, do not have a higher expected return?
We can infer from the above plot…
The plot shows that, in general, the higher the mean return the higher the risk - or standard deviation. This is as expected because financial theory implies that for higher returns investors must anticipate greater volatility/risk.
There are several stocks that sit below the line of best fit and therefore offer investors returns that do not adequately compensate for the greater risk that is undertaken. These stocks are DOW, CSCO, INTC. The worst stock for this is CSCO which offers a very low mean return but offers the fourth highest level of volatility out of the group - this would be a very bad investment. There are a few stocks that outperform the expected mean-volatility trade off, such as V and UNH, these stocks offer greater returns but with below expected volatility.
A recent study by the Bank for International Settlements (BIS) claimed that the Current Inflation Spike Is Just Transitory. As the article says,
The surge in inflation seen across major economies is probably short lived because it’s confined to just a few sectors of the economy, according to the Bank for International Settlements.
New research by the BIS’s Claudio Borio, Piti Disyatat, Egon Zakrajsek and Dora Xia adds to one of the hottest debates in economics – how long the current surge in consumer prices will last. Both Federal Reserve Chair Jerome Powell and his euro-area counterpart Christine Lagarde have said the pickup is probably transitory, despite a snarled global supply chain and a spike in energy prices.
To better understand inflation, we want to use CPI and 10-year yield to produce the following graph:
Our two variables are:
cpi <- tq_get("CPIAUCSL", get = "economic.data",
from = "1980-01-01") %>%
rename(cpi = symbol, # FRED data is given as 'symbol' and 'price'
rate = price) %>% # we rename them to what they really are, e.g., cpi and rate
# calculate yearly change in CPI by dividing current month by same month a year (or 12 months) earlier, minus 1
mutate(cpi_yoy_change = rate/lag(rate, 12) - 1)
ten_year_monthly <- tq_get("GS10", get = "economic.data",
from = "1980-01-01") %>%
rename(ten_year = symbol,
yield = price) %>%
mutate(yield = yield / 100) # original data is not given as, e.g., 0.05, but rather 5, for five percent
# we have the two dataframes-- we now need to join them, and we will use left_join()
# base R has a function merge() that does the same, but it's slow, so please don't use it
mydata <-
cpi %>%
left_join(ten_year_monthly, by="date") %>%
mutate(
year = year(date), # using lubridate::year() to generate a new column with just the year
month = month(date, label = TRUE),
decade=case_when(
year %in% 1980:1989 ~ "1980s",
year %in% 1990:1999 ~ "1990s",
year %in% 2000:2009 ~ "2000s",
year %in% 2010:2019 ~ "2010s",
TRUE ~ "2020s"
)
)Now let’s reproduce the graph.
ggplot(mydata,
aes(x = cpi_yoy_change,
y = yield,
color = decade)) +
geom_point() +
geom_smooth(method = 'lm', #create a trendline without confidence interval
se = FALSE) +
facet_wrap(~ decade, #faceted by decade
nrow = 5,
scales = "free")+
ggrepel::geom_text_repel(aes(label = format(date, #geom_text_repel to avoid overlap
format = "%b %Y")),
check_overlap = TRUE,
size = 2,
segment.color="transparent")+
scale_x_continuous(labels = scales::percent)+ #x-axis as percentage
scale_y_continuous(labels = scales::percent)+ #y-axis as percentage
labs(x = "CPI Yearly change",
y = "10-year Treasury Constant Maturity Rate",
title = "How are CPI and 10-year related?",
caption = "Data Source: FRED") + #Add titles and a caption
theme_bw()+ #Use classic theme
theme(text = element_text(size = 12),
legend.position = "none", #Remove legend
aspect.ratio = 1/12) + #Set the ratio of each faceted graph's height and length
NULLThe task here is again to reproduce a beautiful graph. The graph is about the vaccination rates and 2020 election votes in every county of the United States and it’s from the article The Racial Factor: There’s 77 Counties Which Are Deep Blue But Also Low-Vaxx. Guess What They Have In Common?.
We have three data sources as below:
head(vaccinations %>%
filter(date=="09/20/2021",series_complete_12plus_pop_pct==0))## # A tibble: 6 × 32
## date fips mmwr_week recip_county recip_state series_complete_pop_p…
## <chr> <chr> <dbl> <chr> <chr> <dbl>
## 1 09/20/2021 48139 38 Ellis County TX 0
## 2 09/20/2021 48121 38 Denton County TX 0
## 3 09/20/2021 48227 38 Howard County TX 0
## 4 09/20/2021 48387 38 Red River County TX 0
## 5 09/20/2021 48223 38 Hopkins County TX 0
## 6 09/20/2021 48337 38 Montague County TX 0
## # … with 26 more variables: series_complete_yes <dbl>,
## # series_complete_12plus <dbl>, series_complete_12plus_pop_pct <dbl>,
## # series_complete_18plus <dbl>, series_complete_18plus_pop_pct <dbl>,
## # series_complete_65plus <dbl>, series_complete_65plus_pop_pct <dbl>,
## # completeness_pct <dbl>, administered_dose1_recip <dbl>,
## # administered_dose1_pop_pct <dbl>, administered_dose1_recip_12plus <dbl>,
## # administered_dose1_recip_12plus_pop_pct <dbl>, …
There are 277 counties, mainly from Texas, that show a 0% vaccination at the latest point in time. Looking at them in more detail (see above), we see that there are no values for the other variables. This indicates that there is no vaccination data for these counties available, which is why we will remove them from the dataset.
First we need to process the original dataset a little bit.
vaccinations_latest <- vaccinations %>%
filter(date == "09/20/2021") %>% #filter for most recent data entries and those for which vaccination rate available
mutate(pct_vaccinated = case_when(
recip_state %in% c("CA", "GA", "IA", "MI", "TX") ~
administered_dose1_pop_pct, #use data for doses administered for certain states
T ~ series_complete_pop_pct)) %>%
filter(pct_vaccinated != 0) %>% #drop observations with 0% vaccination rates
select(fips, pct_vaccinated) #select only those columns that necessary for graph
election2020_results_Trump <- election2020_results %>% #create new dataframe that sums Trump votes from duplicates
filter(candidate == "DONALD J TRUMP", mode=="TOTAL") %>%
mutate(Trump_perc = candidatevotes/totalvotes) %>% #calculate %age of Trump votes per county
select(state_po, county_name, fips, candidate, Trump_perc) #last variable to calculate share of Trump votes
chart_data <- election2020_results_Trump %>% #join dataframes
left_join(population,
by = "fips") %>%
left_join(vaccinations_latest) %>%
#mutate(trump_maj = ifelse(Trump_perc >= 0.5, "Yes","No")) #create variable to color graph later
mutate(pct_vaccinated = pct_vaccinated/100) %>%
distinct(fips, .keep_all = TRUE) #remove 1 duplicate
head(chart_data)## # A tibble: 6 × 7
## state_po county_name fips candidate Trump_perc pop_estimate_2019
## <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 AL AUTAUGA 01001 DONALD J TRUMP 0.714 55869
## 2 AL BALDWIN 01003 DONALD J TRUMP 0.762 223234
## 3 AL BARBOUR 01005 DONALD J TRUMP 0.535 24686
## 4 AL BIBB 01007 DONALD J TRUMP 0.784 22394
## 5 AL BLOUNT 01009 DONALD J TRUMP 0.896 57826
## 6 AL BULLOCK 01011 DONALD J TRUMP 0.248 10101
## # … with 1 more variable: pct_vaccinated <dbl>
We attempt to calculate the total percentage of people vaccinated. However, our result for September shows a number that is less than the one shown from the graph in April which is why we will not add the horizontal line showing the actual ratio of people vaccinated in the graph.
#calculate %age of entire population vaccinated for horizontal line in chart
get_total_pct <- chart_data %>%
mutate(pop_vaccinated = pct_vaccinated*pop_estimate_2019) %>%
select(pop_vaccinated, pop_estimate_2019) %>%
colSums(na.rm = TRUE)
get_total_pct[1]/get_total_pct[2] #<50.8% shown in the graph from April## pop_vaccinated
## 0.451
Then we can reproduce the graph.
ggplot(chart_data,
aes(x = Trump_perc,
y = pct_vaccinated,
size = pop_estimate_2019)) + #draw points based on size of population of county
geom_point(color = "darkblue",
alpha = 0.4,
na.rm = TRUE) +
scale_size_continuous(range = c(1, 40)) + #specify scale for size of population-based points
geom_point(size = 0.75) + #add points to show vaccination level per county
geom_smooth(method = "lm", #add regression line
se = FALSE,
size=0.5) +
annotate("rect", #change background color
xmin = 0,
xmax = 0.55,
ymin = -Inf,
ymax = Inf,
alpha = 0.3,
fill = "blue") +
annotate("rect", #change background color
xmin = 0.45,
xmax = 1,
ymin = -Inf,
ymax = Inf,
alpha = 0.3,
fill = "red") +
labs(title = "COVID-19 VACCINATION LEVELS OUT OF TOTAL POPULATION PER COUNTY",
subtitle = "(most states based on FULLY vaccinated only; CA, IA, GA, MI & TX, based on total doses administered)",
y = "% of Total Population Vaccinated",
x = "2020 Trump Vote %") +
geom_hline(yintercept = 0.85, #add horizontal line
linetype = "dashed")+
annotate("text",
x = 0.12,
y = 0.865,
label = "Herd immunity threshold (?)",
size = 3,
color = "blue",
fontface = 2)+
#geom_hline(yintercept = 0.508,
# linetype = "dashed")+ #no value for actual %age added
#annotate("text",x=0.0528, y=0.534,label="ACTUAL: 50.8%",size=3, color="blue",fontface=2)+
geom_hline(yintercept = 0.539, #add horizontal line
linetype = "dashed")+
annotate("text",
x = 0.07,
y = 0.549,
label = "TARGET: 53.9%",
size = 3,
color = "blue",
fontface = 2)+
theme_bw() +
scale_x_continuous(minor_breaks = seq(0, 1, 0.05), #change gridlines
breaks = seq(0, 1.05, by = 0.05),
expand=c(0,0), limits=c(0,1.01),
labels = scales::percent_format(accuracy = 5L)) + #change scale to %
scale_y_continuous(minor_breaks = seq(0, 1, 0.05), #change gridlines
breaks = seq(0, 1.05, by = 0.05),
expand=c(0,0), limits=c(0,1.02),
labels = scales::percent_format(accuracy = 5L)) + #change scale to %
theme(legend.position = "none") + #remove legend
NULLThe second challenge also deals with the reproduction of a sophisticated graph.
The Guardian newspaper has an election poll tracker for the upcoming German election and we will reproduce the graph of this tracker.
The following code will scrape the wikipedia page and import the table in a dataframe.
url <- "https://en.wikipedia.org/wiki/Opinion_polling_for_the_2021_German_federal_election"
# similar graphs and analyses can be found at
# https://www.theguardian.com/world/2021/jun/21/german-election-poll-tracker-who-will-be-the-next-chancellor
# https://www.economist.com/graphic-detail/who-will-succeed-angela-merkel
# get tables that exist on wikipedia page
tables <- url %>%
read_html() %>%
html_nodes(css="table")
# parse HTML tables into a dataframe called polls
# Use purr::map() to create a list of all tables in URL
polls <- map(tables, . %>%
html_table(fill=TRUE)%>%
janitor::clean_names())
# list of opinion polls
german_election_polls <- polls[[1]] %>% # the first table on the page contains the list of all opinions polls
slice(2:(n()-1)) %>% # drop the first row, as it contains again the variable names and last row that contains 2017 results
mutate(
# polls are shown to run from-to, e.g. 9-13 Aug 2021. We keep the last date, 13 Aug here, as the poll date
# and we extract it by picking the last 11 characters from that field
end_date = str_sub(fieldwork_date, -11),
# end_date is still a string, so we convert it into a date object using lubridate::dmy()
end_date = dmy(end_date),
# we also get the month and week number from the date, if we want to do analysis by month- week, etc.
month = month(end_date),
week = isoweek(end_date)
)glimpse(german_election_polls) #check dataset## Rows: 250
## Columns: 16
## $ polling_firm <chr> "2021 federal election", "Wahlkreisprognose", "Ipsos", …
## $ fieldwork_date <chr> "26 Sep 2021", "22–24 Sep 2021", "22–23 Sep 2021", "22–…
## $ samplesize <chr> "–", "1,400", "2,000", "1,273", "2,002", "1,554", "10,0…
## $ abs <chr> "", "–", "–", "–", "26", "–", "–", "–", "–", "–", "–", …
## $ union <dbl> NA, 22.5, 22.0, 23.0, 22.0, 25.0, 23.0, 21.0, 21.5, 21.…
## $ spd <dbl> NA, 25.5, 26.0, 25.0, 25.0, 26.0, 25.0, 25.0, 25.0, 25.…
## $ af_d <dbl> NA, 11.0, 11.0, 10.0, 10.0, 10.0, 10.0, 12.0, 11.0, 11.…
## $ fdp <dbl> NA, 12.0, 12.0, 11.0, 12.0, 10.5, 12.0, 11.0, 12.5, 11.…
## $ linke <dbl> NA, 7.0, 7.0, 6.0, 6.0, 5.0, 6.0, 7.0, 6.5, 7.0, 6.5, 6…
## $ grune <dbl> NA, 14.0, 16.0, 16.5, 17.0, 16.0, 16.0, 14.0, 15.0, 16.…
## $ fw <chr> "", "–", "–", "3", "3", "–", "–", "2", "–", "–", "–", "…
## $ others <chr> "", "8", "6", "5.5", "5", "7.5", "8", "7", "8.5", "9", …
## $ lead <chr> "", "3", "4", "2", "3", "1", "2", "4", "3.5", "4", "3",…
## $ end_date <date> 2021-09-26, 2021-09-24, 2021-09-23, 2021-09-23, 2021-0…
## $ month <dbl> 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9…
## $ week <dbl> 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 37, 37,…
Before plotting the graph, we first need to calculate rolling means for every party’s position in previous polls.
#Calculate everyday means for every party's position
german_election_polls_mean <- german_election_polls %>%
group_by(end_date) %>%
summarise(mean_CDU = mean(union),
mean_SPD = mean(spd),
mean_Grune = mean(grune),
mean_AfD = mean(af_d),
mean_FDP = mean(fdp),
mean_Linke = mean(linke))
#Then calculate 14-day rolling mean
german_election_polls_rollingmean <- german_election_polls_mean %>%
mutate(CDU14 = zoo::rollmean(mean_CDU,
k = 14,
fill = NA,
align = "right"), #Make sure the calculation is based on 14 days BEFORE the date we examine instead of AFTER or BETWEEN
SPD14 = zoo::rollmean(mean_SPD,
k = 14,
fill = NA,
align = "right"),
Grune14 = zoo::rollmean(mean_Grune,
k = 14,
fill = NA,
align = "right"),
AfD14 = zoo::rollmean(mean_AfD,
k = 14,
fill = NA,
align = "right"),
FDP14 = zoo::rollmean(mean_FDP,
k = 14,
fill = NA,
align = "right"),
Linke14 = zoo::rollmean(mean_Linke,
k = 14,
fill = NA,
align = "right"))
#Just to check if everything goes right
glimpse(german_election_polls_rollingmean)## Rows: 146
## Columns: 13
## $ end_date <date> 2021-01-04, 2021-01-05, 2021-01-06, 2021-01-08, 2021-01-11…
## $ mean_CDU <dbl> 36.5, 36.0, 35.0, 36.0, 36.0, 36.0, 37.0, 35.0, 35.0, 35.2,…
## $ mean_SPD <dbl> 15.5, 15.0, 14.0, 14.0, 15.0, 15.0, 15.0, 15.0, 15.0, 14.5,…
## $ mean_Grune <dbl> 18.0, 18.0, 21.0, 20.0, 18.0, 18.0, 20.0, 20.0, 19.0, 17.2,…
## $ mean_AfD <dbl> 10.00, 10.00, 10.00, 8.00, 10.00, 10.00, 10.00, 9.00, 9.00,…
## $ mean_FDP <dbl> 6.75, 6.00, 7.00, 7.00, 7.50, 7.00, 5.00, 6.00, 7.00, 8.75,…
## $ mean_Linke <dbl> 7.75, 9.00, 7.00, 8.00, 8.00, 8.00, 8.00, 8.00, 8.00, 7.75,…
## $ CDU14 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 35.9, 3…
## $ SPD14 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 14.9, 1…
## $ Grune14 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 18.9, 1…
## $ AfD14 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 9.61, 9…
## $ FDP14 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 6.85, 6…
## $ Linke14 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 7.85, 7…
Rolling mean for the first 13 recorded days are NA because we have no previous data in this dataset. It does not hurt our graphing, so we can dismiss it this time.
#Assign a color to each subset in preparation, try to replicate the color in the original graph as close as possible
color <- c("CDU/CSU" = "black",
"SPD" = "red3",
"Grune" = "green3",
"AfD" = "steelblue4",
"FDP" = "goldenrod1",
"Linke" = "mediumorchid3")
ggplot(german_election_polls,
aes(x = end_date)) +
#Graph the point with some transparency
geom_point(aes(y = union,
color = "CDU/CSU"),
alpha = 0.3,
size = 2) +
geom_point(aes(y = spd,
color = "SPD"),
alpha = 0.3,
size = 2) +
geom_point(aes(y = grune,
color = "Grune"),
alpha = 0.3,
size = 2) +
geom_point(aes(y = af_d,
color = "AfD"),
alpha = 0.3,
size = 2) +
geom_point(aes(y = fdp,
color = "FDP"),
alpha = 0.3,
size = 2) +
geom_point(aes(y = linke,
color = "Linke"),
alpha = 0.3,
size = 2) +
#Graph the line to show trend. Note we have no line for the first 13 sample points because it's based on 14-day rolling average
geom_line(data = german_election_polls_rollingmean,
aes(y = CDU14,
color = "CDU/CSU"),
size = 1.2) +
geom_line(data = german_election_polls_rollingmean,
aes(y = SPD14,
color = "SPD"),
size = 1.2) +
geom_line(data = german_election_polls_rollingmean,
aes(y = Grune14,
color = "Grune"),
size = 1.2) +
geom_line(data = german_election_polls_rollingmean,
aes(y = AfD14,
color = "AfD"),
size = 1.2) +
geom_line(data = german_election_polls_rollingmean,
aes(y = FDP14,
color = "FDP"),
size = 1.2) +
geom_line(data = german_election_polls_rollingmean,
aes(y = Linke14,
color = "Linke"),
size = 1.2) +
#Set the format of the x-axis
scale_x_date(date_breaks = "1 month",
date_labels = "%b %Y") +
#Set the limtis and break points of the y-axis
scale_y_continuous(breaks = c(5,15,25,35,45)) +
expand_limits(y = c(5,45)) +
#Set color with pre-defined series
scale_color_manual(name = "Party",
values = color) +
labs(x = "Date",
y = "Share of Votes (%)",
title = "German election poll-tracker",
subtitle = "Who will be the next German chancellor?",
caption = "Source: Wikipedia, last updated on 26 Sep 2021") +
theme(text = element_text(size = 14),
legend.position = "right",
panel.background = element_rect(fill = "white"),
panel.grid.major.x = element_blank(), #Remove the x-axis grid
panel.grid.minor.x = element_blank(),
panel.grid.major.y = element_line(size = 0.5, #Format the y-axis grid
color = "grey",
linetype = "dashed" )) +
NULLAs a true test to yourself, do you understand the code you submitted and are you able to explain it to someone else?
Yes.
Check minus (1/5): Displays minimal effort. Doesn’t complete all components. Code is poorly written and not documented. Uses the same type of plot for each graph, or doesn’t use plots appropriate for the variables being analyzed.
Check (3/5): Solid effort. Hits all the elements. No clear mistakes. Easy to follow (both the code and the output).
Check plus (5/5): Finished all components of the assignment correctly and addressed both challenges. Code is well-documented (both self-documented and with additional comments as necessary). Used tidyverse, instead of base R. Graphs and tables are properly labelled. Analysis is clear and easy to follow, either because graphs are labeled clearly or you’ve written additional text to describe how you interpret the output.